Depending on how you access our project, some figures may not be displayed properly:
In this project, we perform in-depth EDA(Exploratory Data Analysis) on a collection of data sets about COVID-19 in South Korea. Our goal is to use these data to better understand COVID and learn how to prevent it effectively to save lives.
We answered the following questions by performing in-depth data visualization, and regression analysis.
Our project utilizes the following libraries, please make sure they are installed by using the install.packages() command.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(gganimate)
library(tibbletime)
##
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
##
## filter
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(htmltools)
library(leaflet)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
There are many COVID datasets available on Kaggle. However, most of them only have information on date and the number of confirmed cases. This South Korea COVID dataset, on the other hand, contains comprehensive information on cases, patient info, weather, geographical location, contact tracing, etc. Therefore, it is ideal for this project.
The data is collected from KCDC (Korea Centers for Disease Control & Prevention). The data set is available on Kaggle. The following code snippet loads the dataset into R workspace, the best way to view the data is through R.
# Read Datasets, treat empty string value as NA
case <- read.csv('./Data/Case.csv', na.strings = c("", "NA"))
patient <- read.csv('./Data/PatientInfo.csv', na.strings = c("", "NA"))
policy <- read.csv('./Data/Policy.csv', na.strings = c("", "NA"))
region <- read.csv('./Data/Region.csv', na.strings = c("", "NA"))
searchtrend <- read.csv('./Data/SearchTrend.csv', na.strings = c("", "NA"))
seoulfloating <- read.csv('./Data/SeoulFloating.csv', na.strings = c("", "NA"))
time <- read.csv('./Data/Time.csv', na.strings = c("", "NA"))
timeage <- read.csv('./Data/TimeAge.csv', na.strings = c("", "NA"))
timegender <- read.csv('./Data/TimeGender.csv', na.strings = c("", "NA"))
timeprovince <- read.csv('./Data/TimeProvince.csv', na.strings = c("", "NA"))
weather <- read.csv('./Data/Weather.csv', na.strings = c("", "NA"))
route <- read.csv('./Data/routes.csv', na.strings = c("", "NA"))
patient_1 <- read.csv('./Data/patients.csv', na.strings = c("", "NA"))
# Filter out empty value for confirmed date
patient_dates <- patient_1 %>%
filter(!is.na(confirmed_date))
# Plot confirmed cases for each day
patient_dates %>%
group_by(confirmed_date) %>%
summarise(n = n()) %>%
ggplot(aes(as.Date(confirmed_date, format = "%m/%d/%Y"), n))+
geom_line(size = 2, alpha = 0.8, col="red")+
geom_point(size = 5, alpha = 1, col="purple")+
labs(title = "Daily Confirmed Cases of COVID-19 in South Korea by Date",
subtitle = paste0("Total Confirmed Cases: ", nrow(patient)),
caption = "Green Line Indicates the Trend") +
scale_x_date(date_labels = "%b %d", date_breaks = "2 days") +
theme_light() +
theme(plot.title = element_text(face = "bold",
hjust = 0.5, size = 18, color = "black"),
plot.subtitle = element_text(face = "bold",
size = 12, color = "blue")) +
ggrepel::geom_label_repel(aes(label=n), hjust=1.5, vjust=0.9,
fill ="orange") +
geom_smooth(method = "gam", se = TRUE, alpha = 0.5, col = "green")
Analysis:
As shown in the figure, marked by the green line, we can see that there is a huge surge from late Feburary to Early March in South Korea.
# Filter out empty value for confirmed date
time_data <- time
# Plot confirmed cases for each day
ggplot () +
geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = confirmed), color = "blue") +
geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = deceased), color = "red") +
geom_point(data = time_data, aes(x = as.Date(date, format = "%Y-%m-%d"), y = released), color = "green") +
scale_x_date(date_labels = "%b %d", date_breaks = "30 days") +
labs(title = "Total Confirmed, Deased, Released Cases by Date",
caption = "Blue: Confirmed, Red: Deceased, Green: Released")
Analysis:
As shown in the figure, we can see that the total deceased cases trend is relatively flat, while both the total confirmed and total released cases has a huge surge around late Feburary. This matches the observation from previous fiture, where there is a huge surge in confirmed cases around late Feburary to early March.
# Calculate Age for each Confirmed Case
patient_age <- patient_1
patient_age$age = 2020 - patient_1$birth_year
# Calculate Age Groups for Confirmed Cases by 10
patient_age$age_group = cut(patient_age$age,
breaks = seq(0,90, by=10),
right = FALSE,
labels = c("0-10","10-20","20-30",
"30-40","40-50","50-60",
"60-70","70-80","80-90"))
patient_age <- patient_age %>%
select(age_group) %>%
na.omit() %>%
group_by(age_group) %>%
summarise(confirmed_cases = n())
ggplot(patient_age, aes(y = age_group, x = confirmed_cases, fill = age_group)) +
geom_bar(stat="identity") +
xlab("Confirmed Cases") +
ylab("Age Group") +
ggtitle(label = "Confirmed Cases by Age Group")
Analysis:
As shown in the figure, the most affected age group is 20-30, the second most affected group is 50-60.
patient_age <- patient
patient_age$contact_number[patient_age$contact_number == "-"] = NA
patient_age_c = patient_age[!is.na(patient_age$contact_number),]
patient_age_c$contact_number = as.numeric(patient_age_c$contact_number)
patient_age = aggregate(patient_age_c$contact_number, by = list(Category = patient_age_c$age), FUN = mean)
ggplot(patient_age, aes(y = Category, x = x, fill = Category)) +
geom_bar(stat="identity") +
xlab("Contact Number") +
ylab("Age Group") +
ggtitle(label = "Average Contact Number by Age Group")
Analysis:
As shown in the figure, On average, people in 20s-50s, and 70s have contact number above 30.
# Calculate Age for each Confirmed Case
patient_age <- patient_1
patient_age$age = 2020 - patient_1$birth_year
# Calculate Age Groups for Confirmed Cases by 10
patient_age$age_group = cut(patient_age$age,
breaks = seq(0,90, by=10),
right = FALSE,
labels = c("0-10","10-20","20-30",
"30-40","40-50","50-60",
"60-70","70-80","80-90"))
patient_age = patient_age[!is.na(patient_age$deceased_date),]
patient_age <- patient_age %>%
select(age_group) %>%
na.omit() %>%
group_by(age_group) %>%
summarise(deceased_cases = n())
ggplot(patient_age, aes(y = age_group, x = deceased_cases, fill = age_group)) +
geom_bar(stat="identity") +
xlab("deceased Cases") +
ylab("Age Group") +
ggtitle(label = "Deceased Cases by Age Group")
Analysis:
As shown in the figure, the most affected age group is 80-90.
# Group number of patients by province
patient_province <- patient
patient_province <- patient_province %>%
select(province) %>%
na.omit()%>%
group_by(province)%>%
summarise(n=n())
ggplot(patient_province, aes(y=n, x=reorder(province,n), fill=province)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("Province Name") +
ylab("Confirmed Cases") +
ggtitle(label = "Confirmed Cases in Each Province")
Analysis:
As shown in the figure, Seoul is the most affected province, with Gyeongsangbuk-do and Gyeonggi-do close to Seoul.
patient_gender <- patient
patient_gender <- patient_gender %>%
select(sex) %>%
na.omit() %>%
group_by(sex) %>%
summarise(patient_num = n())
ggplot(patient_gender,
aes(x="",y=patient_num,fill=sex)) +
geom_bar(width=1, stat="identity",color="white") +
ggtitle(label = "Confirmed Cases Gender Ratio") +
coord_polar("y",start=0) +
geom_text(aes(label=paste0(round(patient_num/ sum(patient_num)*100,1),"%")),
position=position_stack(vjust=0.5)) +
theme_void() +
theme(plot.title = element_text(size = 30, face = "bold"))
Analysis:
As shown in the figure, COVID affects females more than male.
patient_gender <- patient
patient_gender = patient_gender[!is.na(patient_gender$deceased_date),]
patient_gender <- patient_gender %>%
select(sex) %>%
na.omit() %>%
group_by(sex) %>%
summarise(patient_num = n())
ggplot(patient_gender,
aes(x="",y=patient_num,fill=sex)) +
geom_bar(width=1, stat="identity",color="white") +
ggtitle(label = "Deceased Gender Ratio") +
coord_polar("y",start=0) +
geom_text(aes(label=paste0(round(patient_num/ sum(patient_num)*100,1),"%")),
position=position_stack(vjust=0.5)) +
theme_void() +
theme(plot.title = element_text(size = 30, face = "bold"))
Analysis:
As shown in the figure, male has a much higher mortality rate than female.
patient_gender <- patient
patient_gender_d = patient_gender[!is.na(patient_gender$deceased_date),]
patient_gender <- patient_gender %>%
select(sex) %>%
na.omit() %>%
group_by(sex) %>%
summarise(patient_num = n())
patient_gender_d <- patient_gender_d %>%
select(sex) %>%
na.omit() %>%
group_by(sex) %>%
summarise(patient_num = n())
patient_gender_d[2] = patient_gender_d[2]/ patient_gender[2] * 100
ggplot(patient_gender_d, aes(x = sex, y = patient_num, fill = sex)) +
geom_bar(stat="identity") +
xlab("mortality rate") +
ylab("Gender") +
ggtitle(label = "Mortality Rate by Gender")
patient_age_province <- patient_1
patient_age_province$age = 2020 - patient_1$birth_year
patient_age_province$age_group = cut(patient_age_province$age,
breaks = seq(0,90, by=10),
right = FALSE,
labels = c("0-10","10-20","20-30",
"30-40","40-50","50-60",
"60-70","70-80","80-90"))
patient_age_province %>%
select(age_group, province) %>%
na.omit() %>%
group_by(age_group, province) %>%
summarise(Count = n()) %>%
ggplot() +
geom_tile(aes(x = age_group, y = province, fill = Count)) +
geom_vline(xintercept = seq(0.5,9.5,by = 1), linetype = 'dashed') +
geom_hline(yintercept = seq(0.5,14.5,by = 1), linetype = 'dashed') +
scale_fill_gradientn(colours = c("green","red"),
values = c(0,0.1,1)) +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
theme_classic() +
theme(text = element_text(size = 15, face = "bold"),
axis.text.x = element_text(angle = 90),
legend.position = "top",
legend.key.width = unit(3,"cm"))
Analysis:
As shown in the figure, the most affected group is age from 30-40, living in capital area.
patient_age_province_gender <- patient_age_province
patient_age_province_gender %>%
select(sex, age_group, province) %>%
na.omit() %>%
group_by(sex, age_group, province) %>%
summarise(Count = n()) %>%
ggplot() +
geom_tile(aes(x = age_group, y = province, fill = Count)) +
geom_vline(xintercept = seq(0.5,9.5,by = 1), linetype = 'dashed') +
geom_hline(yintercept = seq(0.5,14.5,by = 1), linetype = 'dashed') +
scale_fill_gradientn(colours = c("green","red"),
values = c(0,0.1,1)) +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
theme_classic() +
theme(text = element_text(size = 15, face = "bold"),
axis.text.x = element_text(angle = 90),
legend.position = "top",
legend.key.width = unit(3,"cm")) +
facet_wrap(~sex)
Analysis:
As shown in the figure, the most affected group is age from 30-40, male, living in capital area.
case$longitude <- as.numeric(as.character(case$longitude))
## Warning: NAs introduced by coercion
case$latitude <- as.numeric(as.character(case$latitude))
## Warning: NAs introduced by coercion
# Set Google API key
register_google(key = "AIzaSyCiHpBw6DsGxXs6pgfgQTd2tP4wAvzrmJI")
# Get Map from Google Maps API
map = get_googlemap(center = c(lon = 127.7669, lat = 35.9087),
zoom = 7, scale = 4,
maptype = 'roadmap',
color = 'color')
figure <- ggmap(map) +
theme_void() +
ggtitle("Number of Confrimed Cases by Case ID in South Korea") +
scale_color_gradientn(colors = rainbow(5)) +
theme(plot.title = element_text(color = "blue"),
panel.border = element_rect(color = "grey", fill = NA, size = 2))
figure + geom_point(data = case,
aes(longitude, latitude, size= confirmed,
color=confirmed, group=confirmed))
## Warning: Removed 109 rows containing missing values (geom_point).
Analysis:
As shown in the figure, the area around Seoul have a large number of small infection events. There is a massive infection event in the city of Nam-gu, Daegu province. This massive infection event is at Shincheonji Church.
# Get a data frame that contains both time and coordinates
case_bak <- subset(case, select = c(province, longitude, latitude))
geo_time <- merge(case_bak, timeprovince, by="province")
geo_time$date <- as.Date(geo_time$date, format = "%Y-%m-%d")
# Set Google API key
register_google(key = "AIzaSyCiHpBw6DsGxXs6pgfgQTd2tP4wAvzrmJI")
# Get Map from Google Maps API
map = get_googlemap(center = c(lon = 127.7669, lat = 35.9087),
zoom = 7, scale = 4,
maptype = 'roadmap',
color = 'color')
figure <- ggmap(map) +
theme_void() +
ggtitle("Number of Confrimed Cases by Case ID in South Korea Over Time") +
scale_color_gradientn(colors = rainbow(5)) +
theme(plot.title = element_text(color = "blue"),
panel.border = element_rect(color = "grey", fill = NA, size = 2))
figure + geom_point(data = geo_time,
aes(longitude, latitude, size= confirmed,
color=confirmed, group=confirmed)) +
transition_time(date)
Analysis:
As shown in the figure, The two COVID center is around Seoul and Daegu province.
# Known Group Infection Events
group_infec <- tibble(lon = c(127.0518049,126.9165592,127.1217472,128.7368285,129.0771404,128.4941047,128.5707814,128.5665743,128.9099632),
lat = c(37.2367708,37.6338843,37.3881528,35.6486172,35.2159947,36.0579173,35.8507155,35.8397168,36.9273528),
address = c('650-103 mangpo-dong, yeongtong-gu, suwon, gyeonggi-do, south korea','1021 tongil-ro, jingwan-dong, eunpyeong-gu, seoul, south korea','20 seohyeon-ro 180(baekpalsip)beo, seohyeon-dong, bundang-gu, seongnam-si, gyeonggi-do, south korea','100-1 beomgok-ri, hwayang-eup, cheongdo, gyeongsangbuk-do, south korea','436-8 oncheon 1(il)-dong, dongnae-gu, busan, south korea','918 yuhak-ro, gasan-myeon, chilgok-gun, gyeongsangbuk-do, south korea','72-10 seongdang-dong, dalseo-gu, daegu, south korea','81 daemyeong-ro, daemyeong 10(sip)-dong, nam-gu, daegu, south korea','916 soro-ri, chunyang-myeon, bonghwa-gun, gyeongsangbuk-do, south korea'),
korean_name = c('수원 생명샘교회','은평 성모병원','분당 제생병원','청도 대남병원','부산 온천교회','칠곡 밀알 사랑의집','대구 한마음아파트','대구 신천지 교회','봉화 푸른 요양원'),
english_name = c('Soowon Saengmyungsame Church','Eunpyung Sungmo Hospital','Boondang Jesaeng Hospital','Chungdo Daenam Hospital','Busan Onchun Church','Chilgok Milal Sarang House','Daegu Hanmaeum Apartment','Daegu Sinchunji Church','Bonghwa Pureun Clinic'),
n_of_infections= c(10,14,11,118,34,22,46,4482,51))
# Pre-processing
group_infec_2 <- group_infec %>% select(lon,lat,korean_name,english_name,n_of_infections) %>%
rename(latitude2 = lat, longitude2 = lon, id = english_name) %>%
as.data.table()
group_infec_3 <- group_infec_2[, list(label=HTML(
paste(sep = "<br>",
paste("G.I(eng): ", id),
paste("G.I(kor): ", korean_name),
paste("Num of infects:", n_of_infections)))),
list(id, longitude2, latitude2)]
#Combining route and patient and pre-processing
agg_1 <- route %>% left_join(patient_1, by="global_id") %>%
mutate(infected_by = ifelse(is.na(infected_by),"Unk",infected_by)) %>%
as.data.table
agg_2 <- agg_1[, list(label=HTML(
paste(sep="<br>",
paste("Patient Id :", global_id),
paste("Sex :", sex),
paste("Country :", country),
paste("Disease :", disease),
paste("Group:", group),
paste("Infection_reason:", infection_reason),
paste("Infection_order:", infection_order),
paste("Infected_by:", infected_by),
paste("Contact_number:", contact_number),
paste("Confirmed_date:", confirmed_date),
paste("State:", state)))),
list(global_id, longitude, latitude)]
listed <- list(agg_2, group_infec_3)
agg_3 <- rbindlist(listed, use.names = TRUE, fill = TRUE)
agg_3 %>%
leaflet() %>%
setView(128.5, 36.5, zoom = 7) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
#addProviderTiles : Stamen.Toner, CartoDB.Positron, Esri.NatGeoWorldMap, MtbMap
addCircleMarkers(
lat = ~ latitude,
lng = ~ longitude,
label = ~label,
clusterOptions = markerClusterOptions(),
labelOptions = labelOptions(noHide = F,
direction = "left",
style = list(
"color" = "black",
"box-shadow" = "4px 4px rgba(0,0,0,0.25)",
"font-size" = "13px",
"border-color" = "rgba(0.5,3,0.5,0.5)"))) %>%
#Addding group infections
addMarkers(
lat = ~ latitude2,
lng = ~ longitude2,
label = ~label,
labelOptions = labelOptions(noHide = F,
direction = "right",
style = list(
"color" = "red",
"font-size" = "13px")))
## Warning in validateCoords(lng, lat, funcName): Data contains 338 rows with
## either missing or invalid lat/lon values and will be ignored
## Warning in validateCoords(lng, lat, funcName): Data contains 2050 rows with
## either missing or invalid lat/lon values and will be ignored
time_case <- read.csv("./Data/TimeProvince.csv", header = TRUE, stringsAsFactor =FALSE) %>%
gather(key = "Cumulative_case",value="number", -c(1:3)) %>%
group_by(date, Cumulative_case) %>% summarise(cum_number = sum(number))
search_trend <- read.csv("./Data/SearchTrend.csv", header = TRUE, stringsAsFactor =FALSE) %>%
gather(key="Search_word", value="SearchTrend", -date) %>%
filter(date >= "2020-01-20")
search_and_case <- search_trend %>% left_join(time_case, by = 'date')
options(repr.plot.width = 20, repr.plot.height = 10)
search_and_case %>%
ggplot(aes(as.Date(date, format = "%Y-%m-%d"), cum_number, group = Cumulative_case, color= Cumulative_case)) +
geom_point() +
geom_line(linetype = 'dashed') +
theme(axis.text.x = element_text(angle = 75, hjust = 1),
legend.text=element_text(size=20),
legend.title=element_text(size=20),
axis.title.x = element_text(color="black", size=20, face="bold"),
axis.title.y = element_text(color="black", size=20, face="bold")) +
geom_point(aes(as.Date(date, format = "%Y-%m-%d"), SearchTrend, group=Search_word, shape=Search_word), color='black', size=3) +
geom_line(aes(as.Date(date, format = "%Y-%m-%d"), SearchTrend, group=Search_word), color='black') +
scale_y_log10() +
scale_x_date(date_labels = "%b %d", date_breaks = "30 days") +
ylab("Search Volume or Cumulative Cases in log scale") +
ggtitle("Growth of COVID-19 cases and Web Search in South Korea")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
Analysis:
As shown in the figure, the search for “cold”, “coronavirus”, “flu” and “pneumonia” skyrocketed since the beginign of the pandemic, then they started to die down as time went by.
During the early stage of COVID, there is a rumor saying that COVID will go away as temperature warms up. We are interested in finding if there is a correlation between temperature and confirmed COVID cases. We performed a simple linear regression on temperature and confirmed cases for the period between 1/20 - 3/20.
w <- weather %>% group_by(date) %>% summarize(mean(avg_temp))
colnames(w) <- c("date","temperature")
t <- time %>% select(date,confirmed)
wt <- merge(w, t, by="date")
fit <- lm(confirmed ~ temperature, data=wt)
summary(fit)
##
## Call:
## lm(formula = confirmed ~ temperature, data = wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6752.2 -1919.3 90.2 2364.6 5350.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1140.75 446.24 2.556 0.0115 *
## temperature 528.60 30.87 17.123 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2780 on 160 degrees of freedom
## Multiple R-squared: 0.647, Adjusted R-squared: 0.6448
## F-statistic: 293.2 on 1 and 160 DF, p-value: < 2.2e-16
Analysis: As the summary shows, the p-value is very small (<2.2e-16), we can reject the null hypothesis. Therefore, there is a relationship between temperature and the number of confirmed cases.
wt <- head(wt, 61)
ggplot(wt, aes(x=temperature, y=confirmed)) +
geom_point(aes(color=date)) +
stat_smooth(method='lm') +
labs(title = "Temperature and Number of Confirmed Cases", x = "Temp", y= "Case")
Analysis:
We select data from 1-20 to 3-20 for clear visualization. A simple linear regression regression does not fit the data. The relationship cannot be explained linearly.